巴西专利BR112019015271B1 METHOD IMPLEMENTED IN COMPUTER, DISTRIBUTED HARDWARE TRACKING SYSTEM AND STORAGE UNIT.

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
a computer-implemented method executed by one or more processors, the method including monitoring the execution of a program code executed by a first processor component; and monitoring the execution of a program code executed by a second processor component. a computing system stores data identifying hardware events in a memory buffer. events stored through processor units that include at least the first and second processor components. each of the hardware events includes an event timestamp and metadata characterizing the event. the system generates a data structure that identifies hardware events. the data structure arranges events in a time-ordered sequence and associates events with at least the first or second processor component. the system stores the data structure in a main device memory bank and uses the data structure to analyze the performance of program code executed by the first or second processor component.
公开号:BR112019015271B1
申请号:R112019015271-7
申请日:2017-10-20
公开日:2021-06-15
发明作者:Thomas Norrie；Naveen Kumar
申请人:Google Llc；
IPC主号:

专利说明:

CROSS REFERENCE ON RELATED REQUEST
[0001] This application is related to U.S. Patent Application No. 15/472,932 entitled "Synchronous Hardware Event Collection", filed March 29, 2017, and Legal Protocol Number 16113-8129001. The entire disclosure of U.S. Patent Application No. 15/472,932 is expressly incorporated herein in its entirety. BACKGROUND
[0002] This report concerns the analysis of the execution of a program code.
[0003] An effective performance analysis of distributed software running on distributed hardware components can be a complex task. Distributed hardware components can be respective processor cores of two or more central processing units (CPUs) (or graphics processing units (GPUs)) that cooperate and interact to execute larger portions of a software program or program code .
[0004] From a hardware perspective (eg on CPUs or GPUs), there are generally two types of information or resources available for performance analysis: 1) hardware performance counters and 2) hardware event traces. SUMMARY
[0005] In general, an aspect of the subject described in this descriptive report can be realized in a method implemented in a computer executed by one or more processors. The method includes monitoring the execution of a program code by a first processor component, the first processor component being configured to execute at least a first portion of the program code; and monitoring execution of the program code by a second processor component, the second processor component being configured to execute at least a second portion of the program code.
[0006] The method further includes storing, by the computing system and in at least one memory buffer, data identifying one or more hardware events occurring through processor units that include the first processor component and the second component of processor. Each hardware event represents at least one of data communications associated with a program code memory access operation, an instruction issued from program code, or an instruction executed from program code. Data identifying each of one or more hardware events includes a hardware event timestamp and metadata characterizing the hardware event. The method includes the generation, by the computer system, of a data structure that identifies one or more hardware events, the data structure being configured to arrange one or more hardware events in a time-ordered sequence of events that are associated to at least the first processor component and the second processor component.
[0007] The method includes the storage, by the computing system, of the data structure generated in a memory bank of a main device for use in analyzing the performance of program code being executed by at least the first processor component or the second processor component.
[0008] These and other implementations may optionally each include one or more of the following features. For example, in some implementations, the method further includes: the detection, by the computer system, of a trigger function associated with portions of a program code being executed by at least one of the first processor component or the second processor component. processor; and, in response to detection of the trigger function, the initiation, by the computer system, of at least one tracking event that causes data associated with one or more hardware events to be stored in at least one memory buffer.
[0009] In some implementations, the trigger function corresponds to at least one of a particular sequence step in the program code or a particular time parameter indicated by a global time clock used by the processor units; and initiating at least one tracking event includes determining that a tracking bit is set to a particular value, at least one tracking event being associated with a memory access operation including multiple intermediate operations that occur across the units. processor, and data associated with multiple intermediate operations is stored in one or more memory buffers in response to a determination that the trace bit is set to the particular value.
[0010] In some implementations, storing data identifying one or more hardware events further includes: storing, in a first memory buffer of the first processor component, a first subset of data identifying hardware events of one or more hardware events. Storage occurs in response to the first processor component executing a hardware trace instruction associated with at least the first portion of the program code.
[0011] In some implementations, storing data identifying one or more hardware events further includes: storing in a second memory buffer of the second processor component a second subset of data identifying hardware events from one or more hardware events. hardware. Storage occurs in response to the second processor component executing a hardware trace instruction associated with at least the second portion of the program code.
[0012] In some implementations, the generation of the data structure still includes: the comparison, by the computer system, of at least hardware event timestamps of respective events in the first dataset identifying hardware events with at least tags hardware event timing of respective events in the second data subset identifying hardware events; and the provision, by the computing system and for presentation in the data structure, of a correlated set of hardware events based, in part, on the comparison between the respective events in the first subset and the respective events in the second subset.
[0013] In some implementations, the generated data structure identifies at least one parameter that indicates a latency attribute of a particular hardware event, the latency attribute indicating at least a duration of the particular hardware event. In some implementations, at least one computing system processor is a multi-node, multi-core processor having one or more processor components, and one or more hardware events correspond, in part, to data transfers that occur between at least the first processor component from a first node and the second processor component from a second node.
[0014] In some implementations, the first processor component and the second processor component are one of: a processor, a processor core, a memory access mechanism, or a computer system hardware resource, and in which one or more hardware events correspond, in part, to a movement of data packets between a source and a destination; and metadata characterizing the hardware event matches at least one of a source memory address, a destination memory address, a unique trace identification number, or a size parameter associated with a direct memory access (DMA) trace ).
[0015] In some implementations, a particular tracking number (ID) is associated with multiple hardware events that occur across the processor units, and wherein the multiple hardware events correspond to a particular memory access operation, and the particular tracking ID number is used for correlation of one or more hardware events from multiple hardware events and is used for determining a latency attribute of the memory access operation based on the correlation.
[0016] Another aspect of the matter described in this specification may be materialized in a distributed hardware tracking system that includes: one or more processors including one or more processor cores; one or more machine-readable storage units for storing instructions that are executable by one or more processors for executing operations including: monitoring the execution of a program code by a first processor component, the first component of processor being configured to execute at least a first portion of the program code; and monitoring execution of the program code by a second processor component, the second processor component being configured to execute at least a second portion of the program code.
[0017] The method further includes storing, by the computing system, data identifying one or more hardware events occurring across processor units that include the first processor component and the second processor component. Each hardware event represents at least one of data communications associated with a program code memory access operation, an instruction issued from program code, or an instruction executed from program code. The method includes the generation, by the computer system, of a data structure that identifies one or more hardware events, the data structure being configured to arrange one or more hardware events in a time-ordered sequence of events that are associated to at least the first processor component and the second processor component.
[0018] The method further includes the storage, by the computing system, of the data structure generated in a memory bank of a main device for use in the performance analysis of program code being executed by at least the first processor component or the second processor component.
[0019] Other implementations of this and other aspects include systems, apparatus and corresponding computer programs configured to perform the actions of the methods, encoded in computer storage devices. A system of one or more computers can be configured this way by virtue of hardware, firmware, hardware, or a combination of them installed in the system that, in operation, cause the system to perform actions. One or more computer programs can be so configured by virtue of having instructions which, when executed by a data processing apparatus, cause the apparatus to carry out the actions.
[0020] The subject described in this descriptive report can be implemented in particular modalities so as to realize one or more of the following advantages. The described hardware tracking systems allow for an efficient correlation of hardware events that occur during the execution of a software program distributed by distributed processing units including multiple-core and multiple-node processors. The hardware tracking system described further includes mechanisms that allow for a collection and correlation of hardware events / tracking data in multiple cross-node configurations.
[0021] The hardware tracking system improves computational efficiency through the use of dynamic triggers that are executed through buttons / hardware resources. Furthermore, hardware events can be time-ordered in a sequenced manner with event descriptors such as unique tracking identifiers, event timestamps, event source address and event destination address. These descriptors help software programmers and processor design engineers with effective error resolution and analysis of software and hardware performance issues that may arise during source code execution.
[0022] Details of one or more implementations of the subject matter described in this specification are set forth in the associated drawings and description below. Other potential features, aspects and advantages of the subject will become evident from the description, drawings and claims. BRIEF DESCRIPTION OF THE DRAWINGS
[0023] Figure 1 is a block diagram of an example computing system for tracking distributed hardware.
[0024] Figure 2 illustrates a block diagram of trace chains and respective nodes of an example computing system for a distributed hardware trace.
[0025] Figure 3 illustrates a block diagram of an example tracking multiplexer design architecture and an example data structure.
[0026] Figure 4 is a block diagram indicating a trace activity for a direct memory access trace event performed by an example computing system for a distributed hardware trace.
[0027] Figure 5 is a process flowchart of an example process for tracking distributed hardware.
[0028] Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION
[0029] The subject described in this descriptive report generally refers to a distributed hardware trace. In particular, a computer system monitors the execution of program code executed by one or more processor cores. For example, the computer system can monitor the execution of program code executed by a first processor core and execution of program code executed by at least one second processor core. The computing system stores data identifying one or more hardware events in a memory buffer. The stored data identifying the events corresponds to events that occur across distributed processor units that include at least the first and second processor cores.
[0030] For each hardware event, the stored data includes an event timestamp and metadata characterizing the hardware event. The system generates a data structure that identifies hardware events. The data structure arranges events in a time-ordered sequence and associates events with at least the first or second processor core. The system stores the data structure in a main device memory bank and uses the data structure to analyze the performance of program code executed by the first or second processor core.
[0031] Figure 1 illustrates a block diagram of an example computing system 100 for a distributed hardware trace. As used in this descriptive report, a distributed hardware system trace is a data store identifying events that occur in components and subcomponents of an example processor microchip. Also, as used herein, a distributed hardware system (or tracking system) corresponds to a collection of processor microchips or processing units that cooperate to execute respective portions of a software / program code configured for distributed execution within the collection. processor microchips or distributed processing units.
[0032] System 100 may be a distributed processing system, having one or more processors or processing units that execute a software program in a distributed manner, that is, by executing different parts or portions of the program code in different system processing units 100. Processing units may include two or more processors, processor microchips or processing units, for example, at least a first processing unit and a second processing unit.
[0033] In some implementations, two or more processing units may be distributed processing units when the first processing unit receives or executes a first portion of program code from a distributed software program, and when the second processing unit receiving and executing a second portion of a program code from the same distributed software program.
[0034] In some implementations, chips from processors other than system 100 may form respective nodes of the distributed hardware system. In alternative implementations, a single processor chip may include one or more processor cores and hardware resources that may each form respective nodes of the processor chip.
[0035] For example, in the context of a central processing unit (CPU), a processor chip can include at least two nodes, and each node can be a respective CPU core. Alternatively, in the context of a graphics processor unit (GPU), a processor chip may include at least two nodes, and each node may be a respective GPU streaming microprocessor. A computing system 100 can include multiple processor components. In some implementations, processor components can be at least one of a processor chip, a processor core, a memory access mechanism, or at least one hardware component of general computing system 100.
[0036] In some cases, a processor component, such as a processor core, may be a fixed-function component configured to execute at least one specific operation based on at least one instruction issued from the executing program code. In other cases, a processor component, such as a memory access mechanism (MAE), may be configured to execute program code at a lower level of detail or granularity than program code executed by other components. of system processor 100.
[0037] For example, a program code executed by a processor core can cause a MAE descriptor to be generated and transmitted / sent to the MAE. Upon receipt of the descriptor, the MAE can perform a data transfer operation based on the MAE descriptor. In some implementations, data transfers performed by the MAE may include, for example, moving data to and from certain components of the system 100 through certain data paths or system interface components, or issuing requests for data on a system 100 example configuration bus.
[0038] In some implementations, each tensor node of an example system 100 processor chip may have at least two "front interfaces", which may be hardware blocks/resources that process program instructions. As discussed in more detail below, a first external interface may correspond to a first processor core 104, while a second external interface may correspond to a second processor core 106. Thus, the first and second processor cores may also be described here as the first external interface 104 and the second external interface 106.
[0039] As used in this specification, a trace chain can be a specific physical data communication bus whose trace entries can be posted for transmission to an example chip manager in system 100. Trace entries received can be words/data structures including multiple bytes and multiple binary values or digits. Thus, the descriptor “word” indicates a fixed-size piece of binary data that can be manipulated as a unit by hardware devices of an example processor core.
[0040] In some implementations, the processor chips of the distributed hardware tracking system are multicore processors (that is, having multiple cores) that each execute portions of a program code on respective cores of the chip. In some implementations, the program code portions may correspond to vector computations for inference workloads of an example multilayer neural network. While in alternative implementations, the program code portions can generally correspond to software modules associated with conventional programming languages.
[0041] The computing system 100 generally includes a node manager 102, a first processor core (FPC) 104, a second processor core (SPC) 106, a node structure (NF) 110, a data router 112 and a main interface (HIB) block 114. In some implementations, system 100 may include a memory multiplexer 108 that is configured to perform signal switching, multiplexing, and demultiplexing functions. System 100 further includes a tensioner core 116 that includes the FPC 104 disposed therein. Tensor core 116 may be an example computing device configured to perform vector computations on multidimensional data arrays. Tensor core 116 may include a vector processing unit (VPU) 118, which interfaces with a matrix unit (MXU) 120, a transpose unit (XU) 122, and a reduction and permutation unit (RPU) 124 In some implementations, computing system 100 may include one or more execution units of a conventional CPU or GPU, such as load/storage units, arithmetic logic units (ALUs), and vector units.
[0042] The components of system 100 collectively include a large set of hardware performance counters as well as supporting hardware that facilitates the completion of a trace activity on the components. As described in greater detail below, program code executed by respective system processor cores 100 may include built-in triggers used to simultaneously enable multiple performance counters during a code execution. In general, detected triggers cause trace data to be generated for one or more trace events. Trace data can correspond to incremental parameter counts that are stored in counters and that can be analyzed to discern performance characteristics of program code. Data for respective tracking events can be stored in an example storage medium (eg a hardware buffer) and can include a timestamp that is generated in response to a trigger detection.
[0043] In addition, trace data can be generated by a variety of events occurring on system 100 hardware components. Example events can include internal and cross-node communication operations, such as direct memory access operations ( DMA) and sync flag updates (each described in more detail below). In some implementations, system 100 may include a globally synchronous timestamp counter referred to as a global timestamp (“GTC”). In other implementations, system 100 may include other types of global clocks, such as a Lampport Clock.
[0044] The GTC can be used for an accurate correlation of program code execution and software / program code performance that runs in a distributed processing environment. Additionally, and related in part to the GTC, in some implementations, system 100 may include one or more trigger mechanisms used by distributed software programs to start and stop tracking data in a distributed system in a highly coordinated manner.
[0045] In some implementations, a main system 126 compiles program code that may include embedded operands that trigger, upon detection, to cause the capture and storage of tracking data associated with hardware events. In some implementations, main system 126 provides compiled program code for one or more system 100 processor chips. In alternative implementations, program code can be compiled (with built-in triggers) by an external example compiler and loaded to one or more system processor 100 chips. In some cases, the compiler may set one or more trace bits (discussed below) associated with certain triggers that are embedded in portions of software instructions. The compiled program code can be a distributed software program that is executed by one or more components of system 100.
[0046] The main system 126 may include a monitoring mechanism 128 configured to monitor the execution of a program code by one or more components of system 100. In some implementations, the monitoring mechanism 128 allows the main system 126 to monitor the execution of a program code executed by at least the FPC 104 and the SPC 106. For example, during a code execution, the main system 126 can monitor, through the monitoring mechanism 128, a performance of the execution code at least by receiving periodic hardware event timelines based on generated tracking data. Although a single block is shown for main system 126, in some implementations, system 126 may include multiple main systems (or main subsystems) that are associated with multiple processor chips or system chip cores 100.
[0047] In other implementations, cross-node communications involving at least three processing cores can cause main system 126 to monitor data traffic at one or more intermediate "hops" as data traffic traverses a path. communication between the FPC 104 and a third core/processor node. For example, the FPC 104 and the third processor core may be the only cores executing program code in a given period of time. In this way, a data transfer from FPC 104 to the third processor core can generate tracking data for an intermediate hop in SPC 106 as data is transferred from FPC 104 to the third processor core. Put another way, during a data routing in system 100, data from a first processor chip going to a third processor chip may need to go through a second processor chip, and thus an execution of the routing operation of data can cause tracking entries to be generated for activity routing on the second chip.
[0048] Upon an execution of compiled program code, the components of system 100 can interact to generate timelines of hardware events that occur in a distributed computer system. Hardware events can include intra-node and cross-node communication events. Example nodes of a distributed hardware system and their associated communications are distributed in more detail below with reference to Figure 2. In some implementations, a data structure is generated that identifies a collection of hardware events for at least one row. of hardware event time. The timeline allows for a reconstruction of events that occur in the distributed system. In some implementations, an event reconstruction may include correct event ordering based on an analysis of timestamps generated during an occurrence of a particular event.
[0049] In general, an example distributed hardware tracking system may include the components described above of system 100, as well as at least one main controller associated with a main system 126. A performance or error resolution of data obtained to from a distributed tracking system can be useful when event data is correlated, for example, in an orderly or time-sequenced manner. In some implementations, a data correlation can occur when multiple stored hardware events corresponding to connected software modules are stored and then sequenced for a structured analysis by main system 126. For implementations including multiple main systems, a correlation of obtained data across different hosts can be performed, for example, by the main controller.
[0050] In some implementations, the FPC 104 and SPC 106 are separate cores each of a multi-core processor chip, while in other implementations the FPC 104 and SPC 106 are respective core processor chip cores multiple. As noted above, system 100 may include distributed processor units having at least one FPC 104 and an SPC 106. In some implementations, distributed processor units of system 100 may include one or more hardware or software components configured to perform at least a portion of a software program or larger distributed program code.
[0051] The data router 112 is an interchip interconnect (ICI) providing data communication paths between system components 100. In particular, the router 112 may provide a communication coupling or connections between the FPC 104 and the SPC 106 , and between respective components associated with cores 104, 106. Node structure 110 interacts with data router 112 to move data packets in distributed hardware components and system subcomponents 100.
[0052] Node manager 102 is a high-level device that manages low-level node functions on multi-node processor chips. As discussed in greater detail below, one or more nodes of a processor chip may include chip managers controlled by node manager 102 for managing and storing hardware event data in local input registers. Memory multiplexer 108 is a multiplexing device that can perform switching, multiplexing and demultiplexing operations on data signals provided to an external high-bandwidth memory (HBM) or data signals received from the external HBM.
[0053] In some implementations, an example trace input (described below) may be generated by the multiplexer 108 when the multiplexer 108 switches between the FPC 104 and the SPC 106. The memory multiplexer 108 can potentially impact the performance of a particular processor core 104, 106 that is not able to access the multiplexer 108. Thus, the tracking input data generated by the multiplexer 108 can help in understanding resulting spikes in latencies of certain system activities associated with the respective cores 104, 106. In some implementations, hardware event data (eg, tracking points discussed below) originating at multiplexer 108 may be grouped into an example hardware event timeline with event data for node structure 110 An event grouping can occur when certain tracking activity causes event data for multiple hardware components to be am stored in an example hardware buffer (for example, trace input register 218 discussed below).
[0054] In system 100, a performance analysis hardware comprises the FPC 104, the SPC 106, the multiplexer 108, the node structure 110, the data router 112 and the HIB 114. Each of these hardware components or units includes hardware performance counters as well as hardware event tracking facilities and functions. In some implementations, VPU 118, MXU 120, XU 122, and RPU 124 do not include their own dedicated performance hardware. Instead, in these implementations, the FPC 104 can be configured to provide the necessary counters for the VPU 118, MXU 120, XU 122, and RPU 124.
[0055] The VPU 118 may include an internal design architecture that supports localized high-bandwidth data processing and arithmetic operations associated with vector elements of an example matrix-vector processor. The MXU 120 is a matrix multiplication unit configured to perform, for example, matrix multiplications up to 128x128 on vector data sets of multiplicands.
[0056] The XU 122 is a transpose unit configured to perform, for example, matrix transpose operations of up to 128x128 on vector data associated with matrix multiplication operations. The RPU 124 can include a sigma unit and a permutation unit. The sigma unit performs sequential reductions on vector data associated with matrix multiplication operations. Reductions can include sums and various types of comparison operations. The permutation unit can fully swap or replicate all vector data elements associated with matrix multiplication operations.
[0057] In some implementations, a program code executed by system components 100 may be representative of machine learning, neural network inference computations, and/or one or more direct memory access functions. System components 100 may be configured to execute one or more software programs including instructions that cause a system processing unit(s) or device(s) to execute one or more functions. The term “component” is intended to include any data processing device or storage device, such as control status registers or any other device capable of processing and storing data.
[0058] System 100 generally may include multiple processing units or devices that may include one or more processors (e.g., microprocessors or central processing units (CPUs)), graphics processing units (GPUs), application-specific integrated circuits (ASICs), or a combination of different processors. In alternative embodiments, system 100 may each include another computing resource/device (eg, cloud-based servers) that provide additional processing options for performing computations related to hardware tracking functions described in this specification.
[0059] Processing units or devices may further include one or more memory units or memory banks (eg registers/counters). In some implementations, the processing units execute programmed instructions stored in memory for system devices 100 to perform one or more functions described in this specification. The memory units/banks may include one or more non-transient machine readable storage media. The non-transient machine-readable storage medium may include a solid-state memory, a magnetic disk and an optical disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (for example, an EPROM, an EEPROM or a flash memory), or any other tangible medium capable of storing information.
[0060] Figure 2 illustrates a block diagram of trace chains and respective example nodes 200, 201 used for a distributed hardware trace performed by system 100. In some implementations, nodes 200, 201 of system 100 may be nodes different on a single multi-core processor. In other implementations, node 200 may be a first node on a first multi-core processor chip and node 201 may be a second node on a second multi-core processor chip.
[0061] Although two nodes are described in the implementation of Figure 2, in alternative implementations, system 100 can include multiple nodes. For implementations involving multiple nodes, cross-node data transfers can generate tracking data at intermediate hops along an example data path that traverses multiple nodes. For example, intermediate hops can correspond to data transfers passing through distinct nodes on a particular data transfer path. In some cases, trace data associated with ICI traces / hardware events may be generated for one or more intermediate hops that occur during cross-node data transfers that pass through one or more nodes.
[0062] In some implementations, node 0 and node 1 are tensor nodes used for vector computations associated with portions of program code for workload inference. As used in this descriptive report, a tensor is a multidimensional geometric object and example multidimensional geometric objects include arrays and arrays of data.
[0063] As shown in the implementation of Figure 2, node 200 includes a trace chain 203 that interacts with at least a subset of the components of system 100. Likewise, node 201 includes a trace chain 205 that interacts with fur minus a subset of the system components 100. In some implementations, nodes 200, 201 are example nodes of the same subset of components, while in other implementations, nodes 200, 201 are respective nodes of distinct component subsets. A data router / ICI 112 includes a trace chain 207 which generally converges with trace chains 203 and 205 for the provision of trace data to a chip manager 216.
[0064] In the implementation of Figure 2, the nodes 200, 201 may each include respective component subsets having at least the FPC 104, the SPC 106, the node structure 110 and the HIB 114. Each component of the nodes 200 , 201 includes one or more trace multiplexers configured for grouping trace points (described below) generated by a particular component of the node. FPC 104 includes tracking multiplexer 204, node structure 110 includes tracking multiplexers 210a/b, SPC 106 includes tracking multiplexers 206a/b/c/d, HIB 214 includes tracking multiplexer 214, and the ICI 112 includes the 212 tracking multiplexer. In some implementations, a tracking control register for each tracking multiplexer allows multiple tracking points to be enabled and disabled. In some cases, for one or more tracking multiplexers, their tracking control registers may include individual enable bits as well as broader tracking multiplexer controls.
[0065] In general, tracking control registers may be conventional status control registers (CSR) that receive and store data and tracking instruction. With reference to broader trace multiplexer controls in some implementations, a trace may be enabled and disabled based on CSR writes performed by system 100. In some implementations, a trace may be dynamically started and stopped by system 100, based on on the value of a global time counter (GTC), the value of an example tracking mark register in FPC 104 (or core 116), or based on the value of an interval mark in SPC 106.
[0066] Details and descriptions relating to computing systems and computer-implemented methods for dynamically starting and stopping a tracking activity, as well as for a synchronized hardware event collection are described in US Patent Application No. 15/472,932 related, entitled “Synchronous Hardware Event Collection”, filed March 29, 2017, and Legal Protocol Number 16113-8129001. The entire disclosure of U.S. Patent Application No. 15/472,932 is expressly incorporated herein in its entirety.
[0067] In some implementations, for core 116, FPC 104 can use a tracking control parameter for defining a tracking window associated with an event activity occurring on core 116. The tracking control parameter allows the tracking window is defined in terms of lower and upper limits for the GTC, as well as lower and upper limits for the tracking mark register.
[0068] In some implementations, system 100 may include functions that allow for a reduction in the number of trace entries that are generated, such as trace event filtering capabilities. For example, the FPC 104 and SPC 106 may each include filtering features that limit the rate at which each core sets a trace bit in an example generated trace descriptor (described below). HIB 114 may include similar filtering features, such as an example DMA rate limiter that limits the tracking bits associated with capturing certain DMA tracking events. Additionally, HIB 114 may include controls (eg, via an enable bit) for limiting which queue source DMA tracking inputs.
[0069] In some implementations, a descriptor for a DMA operation may have a trace bit that is set by a sample main system compiler 126. When the trace bit is set, hardware features / buttons that determine and generate trace data is used to complete an example trace event. In some cases, a trailing trace bit in DMA can be a logical OR operation between a trace bit that is statically inserted by the compiler and a trace bit that is dynamically determined by a particular hardware component. Thus, in some cases, the compiler-generated trace bit can provide a mechanism, apart from filtering, to reduce the overall amount of trace data that is generated.
[0070] For example, a main system compiler 126 may decide to only set trace bits for one or more remote DMA operations (for example, a DMA across at least two nodes) and release trace bits for one or more local DMA operations (for example, a DMA at a particular tensor node, such as node 200). In this way, an amount of trace data that is generated can be reduced based on the trace activity being limited to cross-node (ie, remote) DMA operations, rather than a trace activity that includes both operations. Cross-node and local DMA.
[0071] In some implementations, at least one trace event initiated by system 100 may be associated with a memory access operation that includes multiple intermediate operations taking place across system 100. A descriptor (e.g., a MAE descriptor) for the memory access operation may include a trace bit that causes data associated with multiple intermediate operations to be stored in one or more memory buffers. Thus, the trace bit can be used to "tag" buffer operations and generate multiple trace events at intermediate hops of the DMA operation as data packets traverse system 100.
[0072] In some implementations, the ICI 112 may include a set of enable bits and a set of packet filters that provide a control functionality for each ingress and egress port of a particular node component 200, 201. enable bits and packet filters allow the ICI 112 to enable and disable trace points associated with particular components of nodes 200, 201. In addition to enabling and disabling trace points, the ICI 112 can be configured to filter trace data with based on an event source, an event destination, and a tracking event packet type.
[0073] In some implementations, in addition to using step markers, GTC or trace markers, each trace control register for processor cores 104, 106 and HIB 114 can also include a “everybody” trace mode. This “everybody” tracking mode can enable a tracking across an entire processor chip to be controlled by a 204 tracking multiplexer or a 206a tracking multiplexer. While in tracking mode all tracking multiplexers 204 and 206a can send a "windowed" tracking control signal, which specifies whether that particular tracking multiplexer, multiplexer 204 or multiplexer 206a, is a tracking window .
[0074] The tracking windowed control signal can be universally emitted or transmitted to all other tracking multiplexers, for example, on one processor chip or across multiple processor chips. Sending to other tracking multiplexers can cause all tracking to be enabled when multiplexer 204 or multiplexer 206a is performing tracking activity. In some implementations, the tracking multiplexers associated with the processor cores 104, 106, and the HIB 114 each include a tracking window control register that specifies when and/or how the "track everyone" control signal ” is generated.
[0075] In some implementations, a trace activity in trace multiplexers 201a/b and trace multiplexer 212 is generally enabled based on whether a trace bit is set in data words for DMA operations or control messages that traverse the data router / ICI 112. DMA operations or control messages can be fixed-size binary data structures that can have a trace bit in the set of binary data packets based on certain circumstances or software conditions.
[0076] For example, when a DMA operation is initiated on FPC 104 (or SPC 106) with a trace type DMA instruction and the initiator (processor cores 104 or 106) is in a trace window, bit tracking will be adjusted on that particular DMA. In another example, for FPC 104, control messages for data writes to another component in system 100 will have the tracking bit set, if the FPC 104 is in a tracking window, and a tracking point that does with which tracking data is stored is enabled.
[0077] In some implementations, zero-length DMA operations provide an example of a broader DMA implementation in system 100. For example, some DMA operations may produce non-DMA activity in system 100. non-DMA can also be traced (for example, generating trace data) as if the non-DMA activity were a DMA operation (for example, a DMA activity including non-null length operations). For example, a DMA operation initiated at a source location but without any data (eg, zero length) to be sent or transferred could instead send a control message to the destination location. The control message will indicate that there will be no data to be received, or to work with, at the destination, and the control message itself would be tracked by system 100 as a zero-length non-DMA operation would be tracked.
[0078] In some cases, for SPC 106, zero-length DMA operations can generate a control message, and a tracking bit associated with the message is set only if the DMA had the tracking bit set, i.e., if the control message did not have a null length. In general, DMA operations initiated from main system 126 will have the trace bit set if HIB 114 is in a trace window.
[0079] In the implementation of Figure 2, trace chain 203 receives trace input data for the component subset that aligns with node 0, while trace chain 205 receives trace input data for the component subset which aligns with node 1. Each tracking chain 203, 205, 207 is of distinct data communication paths used by the respective nodes 200, 201 and ICI 112 for the provision of tracking input data to a data record of example trace entry 218 from a chip manager 216. Thus, the end point of trace chains 203, 205, 207 is a chip manager 216, where trace events can be stored in example memory units .
[0080] In some implementations, at least one chip manager memory unit 216 may be 128 bits wide and may have a memory depth of at least 20,000 tracking entries. In alternative implementations, at least one memory unit may have a greater or lesser bit width and may have a memory depth capable of storing more or less entries.
[0081] In some embodiments, the chip manager 216 may include at least one processing device executing instructions for managing tracking input data. For example, the chip manager 216 can execute instructions to scan/analyze timestamp data for respective hardware events of the trace data received through the trace chains 203, 205, 207. Based on the analysis, the data manager chip 216 can populate tracking input register 218 to include data that can be used for identifying (or generating) a time-ordered sequence of hardware tracking events. Hardware trace events can correspond to a movement of data packets occurring at the component and subcomponent level when system processing units 100 execute an example distributed software program.
[0082] In some implementations, system hardware 100 units may generate trace entries (and corresponding timestamps) that fill an example hardware trace buffer in a non-time-ordered (i.e., out-of-order) manner ). For example, the chip manager 216 can cause multiple tracking entries, having generated timestamps, to be inserted into the tracking input register 218. The respective tracking entries of the inserted multiple tracking entries may not be ordered in time in relation to each other. In this implementation, non-time-ordered tracking entries can be received by a main system example main buffer 126. Upon receipt by the main buffer, the main system 126 can execute instructions relating to the performance analysis / monitoring software to scan / analyze timestamp data for respective tracking entries. The executed instructions can be used to sort the trace entries and to build/generate a hardware trace event timeline.
[0083] In some implementations, trace entries may be removed from input register 218 during a trace session via a main DMA operation. In some cases, the main system 126 may not get the DMA entries from the Trace Entry 218 registry as quickly as they are added to the registry. In other implementations, an input register 218 may include a predefined memory depth. If the 218 entry log memory depth limit is reached, additional trace entries may be lost. In order to control which trace inputs are lost, input register 218 can operate in a first-in-first-out (FIFO) mode, or alternatively in an overwrite write mode.
[0084] In some implementations, the overwrite recording mode can be used by the system 100 to support a performance analysis associated with a post-mortem resource resolution. For example, program code can be executed by a certain procedure with trace activity enabled and overwrite recording mode enabled. In response to a post-mortem software event (for example, a fatal program failure) on system 100, monitoring software run by main system 126 can analyze the data contents of a sample hardware trace buffer for itself. gain insight into the hardware events that occurred before the program's catastrophic failure. As used in this descriptive report, a post-mortem error resolution refers to an analysis or resolution of program code errors after the code has fatally failed or has generally failed to execute/operate as intended.
[0085] In FIFO mode, if input log 218 is full, and if main system 126 does remove log entries saved in a certain time frame, to conserve memory resources, new tracking entries may not be saved in a chip manager memory unit 216. While in overwrite write mode, if the input log 218 is full because the main system 126 does not remove log entries saved in a certain time frame, for resource conservation of memory, new trace entries may overwrite the oldest trace entry stored in entry register 218. In some implementations, trace entries are moved to main system memory 126 in response to a DMA operation using resources. HIB processing 114.
[0086] As used in this specification, a trace point is the generator of a trace entry and data associated with the trace entry received by chip manager 216 and stored in trace entry record 218. In some implementations, a microchip Multinode and multicore processor can include three trace chains on the chip, so that a first trace chain receives trace inputs from a chip node 0, a second trace chain receives trace inputs from a chip node 1, and a third trace chain receive trace inputs from a chip's ICI router.
[0087] Each tracking point has a unique tracking identification number in its tracking chain, which it inserts into the header of the tracking entry. In some implementations, each trace entry identifies the trace string that originated from a header indicated by one or more bytes/bits of data word. For example, each tracking entry can include a data structure that has defined field formats (eg header, payload, etc.) that convey information regarding a particular tracking event. Each field in a tracking entry corresponds to useful data applicable to the tracking point that generated a tracking entry.
[0088] As noted above, each trace entry may be written to, or stored in, a chip manager memory unit 216 associated with trace entry register 218. In some implementations, trace points may be enabled or disabled individually and multiple tracking points can generate the same type of tracking albeit with different tracking point identifiers.
[0089] In some implementations, each tracking entry may include a tracking name, a tracking description, and a header that identifies encodings for particular fields and/or a collection of fields in the tracking entry. The name, description and header collectively provide a description of what the trace entry represents. From the perspective of chip manager 216, this description can also identify the particular trace chain 203, 205, 207 where a specific trace entry came from a particular processor chip. Thus, fields in a trace entry represent pieces of data (eg in bytes/bits) relevant to the description and can be a trace entry identifier used to determine which trace point generated a particular trace entry .
[0090] In some implementations, the tracking input data associated with one or more of the stored hardware events may correspond, in part, to data communications that occur: a) between at least a node 0 and a node 1; b) between at least components in node 0; and c) between at least components in node 1. For example, stored hardware events may correspond, in part, to data communications that occur between at least one of: 1) the FPC 104 of node 0 and the FPC 104 of node 1 ; the 0-node FPC 104 and the 0-node SPC 106; 2) the SPC 106 of node 1 and the SPC 106 of node 1.
[0091] Figure 3 illustrates a block diagram of an example tracking multiplexer design architecture 300 and an example data structure 320. The tracking multiplexer design 300 generally includes a tracking bus input 302, a bus arbiter 304 and a local tracking point arbiter 306, a FIFO bus 308 and at least one local tracking event queue 310, a FIFO shared tracking event 312, and a tracking bus output 314.
[0092] Multiplexer design 300 corresponds to an example tracking multiplexer disposed in a system component 100. Multiplexer design 300 may include the following functionality. The 302 bus input can relate to tracking point data that is temporarily stored on the FIFO 308 bus until a time arbitration logic (eg, arbiter 304) can cause the tracking data to be put into a example trace chain. One or more trace points for a component can insert trace event data into at least one local trace event queue 310. Arbitrator 306 provides first-level arbitration and enables a selection of events among local trace events in the queue 310. The selected events are put into a shared FIFO 312 tracking event, which acts as a storage queue.
[0093] Arbitrator 304 provides a second level arbitration that receives local trace events from the FIFO queue 312 and merges the local trace events into a memory access operation 203, 205, 207 in particular via the bus output tracking entries 314. In some implementations, tracking entries can be pushed to local queues 310 faster than they can be merged to shared FIFO 312, or alternatively, tracking entries can be pushed to shared FIFO 312 faster than which can be merged into trace bus 314. When these scenarios occur, respective queues 310 and 312 will become filled with trace data.
[0094] In some implementations, when queue 310 or 312 becomes filled with trace data, system 100 can be configured so that new trace entries are abandoned and not stored in or merged with a particular queue. In other implementations, rather than abandoning tracking entries when certain queues fill up (eg, queues 310, 312), system 100 can be configured to defer an example processing thread until queues that are full once again have room available queue for receiving entries.
[0095] For example, a processing thread that uses queues 310, 312 can be stopped until a sufficient number or limit of trace entries is merged into trace bus 314. The sufficient number or limit can match a particular number of merged trace entries that result in a queue space available for one or more trace entries to be received by queues 310, 312. Implementations in which processing threads are stopped, until a downstream queue space becomes available , can provide higher fidelity tracking data based on certain tracking entries being retained rather than abandoned.
[0096] In some implementations, the local trace queues are as wide as required by the trace entry, such that each trace entry takes only one place in local queue 310. However, a shared trace FIFO queue 312 can use one single trace input line encoding so that some trace entries can occupy two locations in shared queue 312. In some implementations, when any data from a trace packet is dropped, the entire packet is dropped so that no packets partial appears in the 218 tracking entry.
[0097] In general, a trace is a timeline of activities or hardware events associated with a particular component of system 100. Unlike performance counters (described below), which are aggregated data, traces contain data from detailed events that provide insight into a hardware activity occurring during a specified tracking window. The hardware system described allows extensive support for distributed hardware trace, including generation of trace entries, temporary storage of trace entries in a managed hardware buffer, static and dynamic enabling of one or more trace types, and a continuous transmission of tracking input data to the main system 126.
[0098] In some implementations, traces may be generated for hardware events executed by system components 100, such as generating a DMA operation, executing a DMA operation, issuing/executing certain instructions, or updating sync flags. In some cases, a trace activity can be used to trace DMAs across the system, or to follow instructions running on a particular processor core.
[0099] System 100 may be configured to generate at least one data structure 320 that identifies one or more hardware events 322, 324 from a hardware event timeline. In some implementations, data structure 320 arranges one or more hardware events 322, 324 into a time-ordered sequence of events that are associated with at least one FPC 104 and one SPC 106. In some cases, system 100 may store a data structure 320 in a main system main control device memory bank 126. Data structure 320 may be used for evaluating the performance of program code executed by at least processor cores 104 and 106.
[0100] As shown by hardware events 324, in some implementations, a particular tracking identification (ID) number (eg, tracking ID '003) can be associated with multiple hardware events that occur across the hardware units. distributed processor. Multiple hardware events can correspond to a particular memory access operation (eg, a DMA), and the particular tracking ID number is used to correlate one or more hardware events.
[0101] For example, as indicated by event 324, a unique tracking ID for a DMA operation may include multiple time slots corresponding to multiple different points in the DMA. In some cases, tracking ID ‘003 may have an “emitted” event, an “executed” event, and a “completed” event that are identified as being some time spaced apart from each other. Thus, in this sense, the tracking ID can be additionally used for determining a latency attribute of the memory access operation based on correlation and with reference to time intervals.
[0102] In some implementations, the generation of data structure 320 may include, for example, system 100 comparing event timestamps of respective events in a first subset of hardware events with event timestamps of respective events in a second subset of hardware events. The generation of data structure 320 may further include system 100 providing, for presentation in the data structure, a correlated set of hardware events based, in part, on the comparison between the first subset of events and the second subset of events.
[0103] As shown in Figure 3, data structure 320 can identify at least one parameter that indicates a latency attribute of a particular hardware event 322, 324. hardware in particular. In some implementations, data structure 320 is generated by software instructions executed by a main system control device 126. In some cases, structure 320 may be generated in response to the control device storing tracking input data for a disk / a main system memory unit 126.
[0104] Figure 4 is a block diagram 400 that indicates an example trace activity for a direct memory access (DMA) trace event performed by system 100. For a DMA trace, the data for a DMA operation. Example DMA originating from a first processor node to a second processor node may travel through ICI 112 and may generate intermediate ICI/router hops along the data path. The DMA operation will generate trace entries at each node on a processor chip, and along each hop, as the DMA operation traverses ICI 112. Information is captured by each of these generated trace entries for the reconstruction of a temporal progression of DMA operations along nodes and hops.
[0105] An example DMA operation can be associated with the process steps described in the implementation of Figure 4. For this operation, a local DMA transfers data from a virtual memory 402 (vmem 402) associated with at least one of the processor cores 104, 106 to HBM 108. The numbering depicted in diagram 400 corresponds to the steps in table 404 and generally represents activities in node structure 110 or activities initiated by node structure 110.
[0106] The steps in table 404 generally describe associated tracking points. The example operation will generate six trace entries for this DMA. Step one includes the initial DMA request from the processor core to the node structure 110, which generates a tracking point in the node structure. Step two includes a read command where node structure 110 asks the processor core to transfer data that generates another tracking point in node structure 110. The example operation does not have a tracking input for step three when vmem 402 completes a node structure read 110.
[0107] Step four includes node structure 110 performing a read resource update to do a sync flag update on the processor core that generates a trace point on the processor core. Step five includes a write command in which the node structure 110 notifies the memory multiplexer 108 of data to be written to the HBM. Notification via the write command generates a tracepoint in the node structure 110, while, in step six, the completion of writing to the HBM also generates a tracepoint in the node structure 110. In step seven, the node structure 110 performs a write resource update to cause a sync flag update on the processor core, which generates a tracepoint on the processor core (for example, on the FPC 104). In addition to the write resource update, the node structure 110 may perform an acknowledge update ("ACK update"), in which a data completion for the DMA operation is signaled back to the processor core. The ACK update can generate trace entries that are similar to the trace entries generated by the write resource update.
[0108] In another example DMA operation, a first trace input is generated when a DMA instruction is issued in a node structure 110 of the originating node. Additional tracking entries can be generated in the node structure 110 to capture the time used to read data to the DMA and write the data to outgoing queues. In some implementations, node structure 110 can pack DMA data into smaller chunks of data. For data packed into smaller chunks, read and write trace entries can be produced for a first chunk of data and a last chunk of data. Optionally, in addition to the first and last data snippets, all data snippets can be tuned to generate trace entries.
[0109] For remote / non-local DMA operations that might require ICI hops, the first data chunk and the last data chunk can generate additional trace entries at ingress and egress points at each intermediate hop along an ICI / a router 112. When DMA data arrives at a destination node, trace entries similar to the previous node structure entries 110 are generated (eg read/write of first and last data fragments) at the destination node. In some implementations, a final step of the DMA operation may include executed instructions associated with the DMA causing an update to a sync flag on the target node. When the sync flag is updated, a trace entry may be generated indicating the completion of the DMA operation.
[0110] In some implementations, a DMA trace is initiated by FPC 104, SPC 106 or HIB 114 when each component is in a trace mode so that trace points can be executed. Components of system 100 can enter tracking mode based on global controls on the FPC 104 or SPC 106 through a trigger mechanism. Tracking points fire in response to the occurrence of a specific action or condition associated with an execution of program code by the components of system 100. For example, portions of program code may include built-in trigger functions that are detectable by at least a system hardware component 100.
[0111] The components of the system 100 can be configured to detect a trigger function associated with portions of program code executed by at least one of the FPC 104 or the SPC 106. In some cases, the trigger function may correspond to at least one of: 1) a particular sequence step in a portion or module of the executed program code; or 2) a particular time parameter indicated by the GTC used by the distributed processor units of system 100.
[0112] In response to detection of the trigger function, a particular component of system 100 may initiate, trigger, or execute at least one tracking point (eg, a tracking event) that causes associated tracking input data to one or more hardware events to be stored in at least one hardware component memory buffer. As noted above, the stored trace data can then be provided to the chip manager 216 via at least one trace chain 203, 205, 207.
[0113] Figure 5 is a process flowchart of an example process 500 for a distributed hardware trace using system components resources 100 and one or more nodes 200, 201 of system 100. Thus, process 500 can be implemented using one or more of the above mentioned computing resources from systems 100 including nodes 200, 201.
[0114] Process 500 begins at block 502 and includes computer system 100 monitoring the execution of a program code executed by one or more processor components (including at least the FPC 104 and the SPC 106). In some implementations, the execution of program code that generates tracking activities can be monitored, at least in part, by multiple main systems or subsystems of a single main system. Thus, in these implementations, system 100 can run multiple processes 500 with respect to an analysis of tracking activities for hardware events occurring across distributed processing units.
[0115] In some implementations, a first processor component is configured to execute at least a first portion of the program code that is monitored. At block 504, process 500 includes computer system 100 monitoring the execution of program code executed by a second processor component. In some implementations, the second processor component is configured to execute at least a second portion of the program code that is monitored.
[0116] The components of computing system 100 may each include at least one memory buffer. Block 506 of process 500 includes system 100 storing data identifying one or more hardware events in at least one memory buffer of a particular component. In some implementations, hardware events occur through distributed processor units that include at least a first processor component and a second processor component. The stored data identifying hardware events can each include a hardware event timestamp and metadata characterizing the hardware event. In some implementations, a is an event collection of timeline events. hardware
[0117] For example, system 100 may store data identifying one or more hardware events that correspond, in part, to a movement of data packets between a source hardware component in system 100 and a destination hardware component in system 100. In some implementations, the stored metadata characterizing the hardware event may correspond to at least one of: 1) a source memory address, 2) a destination memory address, 3) a unique tracking identification number with respect to the trace entry that causes the hardware event to be stored, or 4) a size parameter associated with a direct memory access (DMA) trace entry.
[0118] In some implementations, storing data identifying a collection of hardware events includes storing event data in an FPC 104 and/or SPC 106 memory buffer that corresponds, for example, to at least one queue tracking event data 310. The stored event data can indicate subsets of hardware event data that can be used for generating a larger hardware event timeline. In some implementations, event data storage occurs in response to at least one of the FPC 104 or the SPC 106 executing hardware tracking instructions associated with portions of a program code executed by components of system 100.
[0119] In block 508 of process 500, system 100 generates a data structure, such as structure 320, that identifies one or more hardware events from the collection of hardware events. The data structure may arrange one or more hardware events in a time-ordered sequence of events that are associated with at least one of the first processor component and the second processor component. In some implementations, the data structure identifies a hardware event timestamp for a particular trace event, a source address associated with the trace event, or a memory address associated with the trace event.
[0120] In block 510 of process 500, system 100 stores the generated data structure in a memory bank of a main device associated with main system 126. In some implementations, the stored data structure may be used by main system 126 for analyzing a performance of program code executed by at least one of the first processor component or the second processor component. Likewise, the stored data structure can be used by the main system 126 for analyzing the performance of at least one component of system 100.
[0121] For example, the user or main system 126 can analyze the data structure to detect or determine if there is a resource problem associated with an execution of a particular software module in the program code. An example problem might include the software module not completing execution in an allocated runtime window.
[0122] Further, the user or main system 126 can detect or determine whether a particular component of system 100 is operating above or below a threshold performance level. An example problem relating to component performance might include a particular hardware component executing certain events but generating result data that is outside acceptable parameter ranges for result data. In some implementations, the result data may not be consistent with the result data generated by other related components of system 100 that perform substantially similar operations.
[0123] For example, during program code execution, a first system component 100 may be required to complete an operation and generate a result. Likewise, a second system component 100 may be required to complete a substantially similar operation and generate a substantially similar result. An analysis of the generated data structure might indicate that the second component generated a result that is drastically different from the result generated by the first component. Likewise, the data structure may indicate a second component result parameter value that is noticeably outside a range of acceptable result parameters. These results could likely indicate a potential performance issue with the second component of system 100.
[0124] The subject modalities and functional operations described in this descriptive report can be implemented in a digital electronic circuit, in a computer software or firmware embodied in a tangible form, in a computer hardware, including the structures exposed in this descriptive report and their structural equivalents, or in combinations of one or more of them. The subject modalities described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded in a tangible non-transient program carrier for execution by, or for control of operation of, a data processing apparatus. Alternatively or additionally, program instructions may be encoded in an artificially generated propagated signal, for example, a machine-generated electrical, optical or electromagnetic signal, which is generated to encode information for transmission to a suitable receiving apparatus. for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a storage substrate that is machine-readable, a serial or random-access memory device, or a combination of one or more of them.
[0125] The logical processes and flows described in this descriptive report may be executed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output(s). Logic processes and flows can also be performed by, and an apparatus can also be implemented as, a special purpose logic circuit, for example, an FPGA (field programmable gate array), an ASIC (application specific integrated circuit) or a GPGPU (General Purpose Graphics Processing Unit).
[0126] Computers suitable for executing a computer program include, by way of example, they may be based on general or special purpose microprocessors, or both, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from read-only memory or random access memory or both. The essential elements of a computer are a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include or be operatively coupled to receive data from, or transfer data to, or both, one or more mass storage devices for data storage, for example, magnetic, magnetic-optical or optical disks. . However, a computer does not need to have these devices.
[0127] Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, memory media and devices, including, by way of example, semiconductor memory devices, for example, an EPROM, an EEPROM, and flash memory devices; magnetic disks, for example internal hard disks or removable disks. Processor and memory can be supplemented by or incorporated into a special-purpose logic circuit.
[0128] Although this descriptive report contains many implementation-specific details, these should not be construed as limitations on the scope of any invention or what can be claimed, but rather as descriptions of features that may be specific to particular modalities. of inventions in particular. Certain features that are described in this descriptive report in the context of separate modalities can also be implemented in combination in a single modality. Conversely, multiple features that are described in the context of a single modality can also be implemented in multiple modalities separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features of a claimed combination may in some cases be removed from the combination, and the claimed combination may be directed to a sub-combination or variation of a subcombination.
[0129] Similarly, although operations are described in the drawings in a particular order, this is not to be understood as requiring that these operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed , to obtain the desired results. In certain circumstances, multitasking execution or parallel processing can be advantageous. Furthermore, the separation of various system modules and components in the modalities described above is not to be understood as requiring such separation in all modalities, and it should be understood that the described program components and systems can generally be integrated together in a single software product or bundled into multiple software products.
[0130] The particular modalities of the matter have been described. Other modalities are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still get desirable results. As an example, the processes described in the associated Figures do not necessarily require the particular order shown, or sequential order, to obtain desirable results. In certain implementations, multitasking or parallel processing can be advantageous.

权利要求:
Claims (19)
[0001]
1. Method implemented on a computer executed by a distributed hardware tracking system, the method characterized by the fact that it comprises: monitoring the execution of an instruction set by a first multi-core processor component in the distributed hardware tracking system, the first multi-core processor component being configured to execute at least a first portion of the instruction set, and wherein the distributed hardware tracking system includes multiple component nodes, each component node corresponding to a particular hardware component of the distributed hardware tracking system; monitoring the execution of the instruction set by a second multi-core processor component in the distributed hardware tracking system, the second multi-core processor component being configured to execute at least a second portion from the instruction set ;the storage, by the distributed hardware tracking system, of data identifying one or more hardware events occurring through the distributed hardware tracking system which includes the first multi-core processor component and the second multi-core processor component, each hardware event representing at least: i) data communications regarding a memory access operation where data is routed at least between respective component nodes of the distributed hardware tracking system during instruction set execution or ii) a time status and an execution status of an instruction included in the instruction set, where the data identifying each of one or more hardware events comprises a hardware event timestamp and metadata characterizing the hardware event; the generation , by the distributed hardware tracking system, of a data structure that identifies one or more hardware events, the data structure being configured to arrange one or more hardware events into a time-ordered sequence of events that occurred during execution of an instruction set by at least the first multi-core processor component and the second component of multi-core processor, wherein the data structure comprises a particular tracking number (ID) that is associated with multiple hardware events, corresponding to a particular memory access operation, that occur through the tracking system. distributed hardware, and where the particular tracking number (ID) is used for correlation of one or more hardware events to multiple hardware events; and the storage, by the distributed hardware tracking system, of the data structure generated in a memory bank of a main device.
[0002]
2. Method according to claim 1, characterized in that it further comprises: the detection, by the distributed hardware tracking system, of a trigger function associated with portions of a set of instructions being executed by at least one of the first multi-core processor component or second multi-core processor component; and in response to detection of the trigger function, initiation, by the distributed hardware tracking system, of at least one tracking event which causes data associated with one or more hardware events to be stored in at least one memory buffer.
[0003]
3. Method according to claim 2, characterized in that the trigger function corresponds to a particular sequence step in the instruction set or a particular time parameter indicated by a global time clock used by the tracking system distributed hardware; and the initiation of at least one trace event comprises determining that a trace bit is set to a particular value, the at least one trace event being associated with a memory access operation including multiple intermediate operations that occur through the system. distributed hardware tracking; Data associated with multiple intermediate operations is stored in one or more memory buffers in response to a determination that the trace bit is set to the particular value.
[0004]
4. Method according to claim 1, characterized in that storing data identifying one or more hardware events further comprises: storing, in a first memory buffer of the first multicore processor component, a first data subset identifying hardware events of one or more hardware events, where storage occurs in response to the first multi-core processor component that executes a hardware trace instruction associated with at least the first portion of the instruction set.
[0005]
5. Method according to claim 4, characterized in that storing data identifying one or more hardware events further comprises: storing in a second memory buffer of the second multicore processor component of a second subset of data identifying hardware events from one or more hardware events, where storage occurs in response to the second multi-core processor component that executes a hardware trace instruction associated with at least the second portion of the instruction set.
[0006]
6. Method according to claim 5, characterized in that the generation of the data structure further comprises: the comparison, by the distributed hardware tracking system, of at least hardware event time stamps of respective events in the first data subset identifying hardware events with at least hardware event timestamps of respective events in the second data subset identifying hardware events; and the provision, by the distributed hardware tracking system and for presentation in the data structure, of a correlated set of hardware events based, in part, on the comparison between the respective events in the first subset and the respective events in the second subset.
[0007]
7. Method according to claim 1, characterized in that the generated data structure identifies at least one parameter that indicates a latency attribute of a particular hardware event or the time status of the instruction in the instruction set, the latency attribute indicating at least one duration of the particular hardware event.
[0008]
8. The method of claim 1, wherein at least one processor of the distributed hardware tracking system is a multi-node, multi-core processor having one or more multi-core processor components, and one or more hardware events correspond, in part, to data transfers that occur between at least the first multi-core processor component of a first node and the second multi-core processor component of a second node.
[0009]
9. Method according to claim 1, characterized in that: each of the first multi-core processor component and the second multi-core processor component are one of: a multi-core processor, one node processor multiple, a memory access mechanism, or a hardware component of the distributed hardware tracking system, and wherein one or more hardware events correspond, in part, to a movement of data packets between a source and a destination; Data characterizing the hardware event corresponds to at least one of a source memory address, a destination memory address, a unique trace identification number, or a size parameter associated with a direct memory access (DMA) trace ).
[0010]
10. Distributed hardware tracking system, characterized in that it comprises: one or more multi-core processor components; one or more non-transient machine-readable storage units for storing instructions that are executable by one or more processor components to cause the performance of operations comprising: monitoring the execution of an instruction set by a first multi-core processor component in the distributed hardware tracking system, the first multi-core processor component being configured to execute by minus a first portion of the instruction set, and wherein the distributed hardware tracking system includes multiple component nodes, each component node corresponding to a particular hardware component of the distributed hardware tracking system; monitoring the execution of the set of instructions for a second with multiple-core processor component in the distributed hardware tracking system, the second multiple-core processor component being configured to execute at least a second portion of the instruction set; storage, by a computing system including the tracking system of distributed hardware, of data identifying one or more hardware events occurring through the distributed hardware tracking system that includes the first multi-core processor component and the second multi-core processor component, each hardware event representing at least: i) data communications regarding a memory access operation where data is routed at least between respective component nodes of the distributed hardware tracking system during instruction set execution or ii) a time status and an execution status of an instruction included in the instruction set, where the data identifying each of one or more hardware events comprise a hardware event timestamp and metadata characterizing the hardware event; the generation, by the computer system, of a data structure that identifies one or more hardware events, the data structure being configured to arrange one or more hardware events in a time-ordered sequence of events that occurred during execution of an instruction set by at least the first multi-core processor component and the second multi-core processor component , wherein the data structure comprises a particular tracking number (ID) that is associated with multiple hardware events, corresponding to a particular memory access operation, which occur through the distributed hardware tracking system, and in that the particular tracking number (ID) is used for correlation of one or more hardware events from multiple har events. dware; and the storage, by the computing system, of the data structure generated in a memory bank of a main device.
[0011]
11. Distributed hardware tracking system, according to claim 10, characterized in that the operations further comprise: the detection, by the computer system, of a trigger function associated with portions of a set of instructions being executed by the at least one of the first multi-core processor component or the second multi-core processor component; and in response to detection of the trigger function, initiation, by the computing system, of at least one tracking event that causes data associated with one or more hardware events to be stored in at least one memory buffer.
[0012]
12. Distributed hardware tracking system according to claim 11, characterized in that the trigger function corresponds to a particular sequence step in the instruction set or a particular time parameter indicated by a global time clock used by distributed hardware tracking system; initiating at least one tracking event comprises determining that a tracking bit is set to a particular value, at least one tracking event being associated with a memory access operation including multiple intermediary operations that take place through the distributed hardware tracking system; Data associated with multiple intermediate operations is stored in one or more memory buffers in response to a determination that the trace bit is set to the particular value.
[0013]
13. Distributed hardware tracking system according to claim 10, characterized in that storing data identifying one or more hardware events further comprises: storing, in a first memory buffer of the first core processor component multiple, of a first subset of data identifying hardware events of one or more hardware events, wherein the storage occurs in response to the first multi-core processor component executing a hardware trace instruction associated with at least the first portion of the instruction set.
[0014]
14. Distributed hardware tracking system according to claim 13, characterized in that storing data identifying one or more hardware events further comprises: storing in a second memory buffer of the second multi-core processor component of a second subset of data identifying hardware events from one or more hardware events, where storage occurs in response to the second multi-core processor component executing a hardware trace instruction associated with at least the second portion of the set of instructions.
[0015]
15. Distributed hardware tracking system, according to claim 14, characterized in that the generation of the data structure further comprises: the comparison, by the computer system, of at least hardware event time stamps of respective events in the first subset of data identifying hardware events with at least hardware event timestamps of respective events in the second subset of data identifying hardware events; and the provision, by the computing system and for presentation in the data structure, of a correlated set of hardware events based, in part, on the comparison between the respective events in the first subset and the respective events in the second subset.
[0016]
16. Distributed hardware tracking system, according to claim 10, characterized in that the generated data structure identifies at least one parameter that indicates a latency attribute of a particular hardware event or the time status of the instruction in the instruction set, the latency attribute indicating at least one duration of the particular hardware event.
[0017]
17. Distributed hardware tracking system according to claim 10, characterized in that at least one processor of the computing system is a multi-node, multi-core processor having one or more multi-core processor components, and a or more hardware events correspond, in part, to data communications that occur between at least the first multi-core processor component of a first node and the second multi-core processor component of a second node.
[0018]
18. A distributed hardware tracking system according to claim 10, characterized in that: each of the first multi-core processor component and the second multi-core processor component are one of: a multi-core processor , a multi-node processor, a memory access mechanism, or a computer system hardware component; wherein one or more hardware events correspond, in part, to a movement of data packets between a source and a destination ; Data characterizing the hardware event corresponds to at least one of a source memory address, a destination memory address, a distinctive tracking identification (ID) number, or a size parameter associated with an access tracking request to direct memory.
[0019]
19. Non-transient machine readable storage unit for storing instructions, characterized by the fact that the instructions are executable by one or more multi-core processor components to cause the performance of operations comprising: monitoring the execution of an instruction set by a first multi-core processor component in a distributed hardware tracking system, the first multi-core processor component being configured to execute at least a first portion of the instruction set, and wherein the distributed hardware tracking system includes multiple component nodes, each component node corresponding to a particular hardware component of the distributed hardware tracking system; monitoring the execution of the instruction set by a second multi-core processor component in the distributed hardware tracking system, the second component of pro multicore terminator being configured to execute at least a second portion of the instruction set; the storage, by a computing system that includes the distributed hardware tracking system, of data identifying one or more hardware events occurring through the distributed hardware tracking system that includes the first multi-core processor component and the second component of multi-core processor, each hardware event representing at least: i) data communications regarding a memory access operation where data is routed at least between respective component nodes of the distributed hardware tracking system during assembly execution or ii) a time status and an execution status of an instruction included in the instruction set, wherein the data identifying each of the one or more hardware events comprises a hardware event timestamp and metadata characterizing the hardware event; the generation, by the computer system, of a data structure that identifies a m or more hardware events, the data structure being configured to arrange one or more hardware events into a time-ordered sequence of events that occurred during execution of an instruction set by at least the first multi-core processor component and the second multi-core processor component, wherein the data structure comprises a particular tracking number (ID) that is associated with multiple hardware events, corresponding to a particular memory access operation, occurring through the system. of distributed hardware tracking, and where the particular tracking number (ID) is used for correlation of one or more hardware events to multiple hardware events; and the storage, by the computing system, of the data structure generated in a memory bank of a main device.

类似技术:

公开号 | 公开日 | 专利标题

BR112019015271B1|2021-06-15|METHOD IMPLEMENTED IN COMPUTER, DISTRIBUTED HARDWARE TRACKING SYSTEM AND STORAGE UNIT.

BR112019015427B1|2022-01-11|SYNCHRONOUS HARDWARE EVENT COLLECTION

CN107209698B|2020-09-29|Techniques for fast synchronization barriers for many-core processing

US20130159681A1|2013-06-20|Verifying speculative multithreading in an application

KR102365640B1|2022-02-18|Distributed hardware tracing

TW202203041A|2022-01-16|A computer-implemented method, system, and non-transitory computer storage unit for distributed hardware tracing

KR20220025262A|2022-03-03|Distributed hardware tracing

US20190042225A1|2019-02-07|Post-compile cache blocking analyzer

同族专利:

公开号 | 公开日

US9875167B1|2018-01-23|

US20200065206A1|2020-02-27|

KR102277867B1|2021-07-14|

DE102017125481A1|2018-10-04|

DE202017106613U1|2018-02-15|

CN113778785A|2021-12-10|

GB2561042B|2019-06-26|

US10990494B2|2021-04-27|

JP6845338B2|2021-03-17|

CN108694112B|2021-08-20|

WO2018182782A1|2018-10-04|

CN108694112A|2018-10-23|

TW201935254A|2019-09-01|

BR112019015271A2|2020-04-14|

TWI661306B|2019-06-01|

TW201837719A|2018-10-16|

EP3382551A1|2018-10-03|

KR20190095458A|2019-08-14|

KR20210089791A|2021-07-16|

US20190332509A1|2019-10-31|

SG10202104613UA|2021-06-29|

JP2021108129A|2021-07-29|

JP2020512612A|2020-04-23|

US10896110B2|2021-01-19|

US10324817B2|2019-06-18|

GB201717923D0|2017-12-13|

CN208766643U|2019-04-19|

GB2561042A|2018-10-03|

US20210248052A1|2021-08-12|

TWI741287B|2021-10-01|

US20180285226A1|2018-10-04|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US4598364A|1983-06-29|1986-07-01|International Business Machines Corporation|Efficient trace method adaptable to multiprocessors|

US6128415A|1996-09-06|2000-10-03|Polaroid Corporation|Device profiles for use in a digital image processing system|

US5682328A|1996-09-11|1997-10-28|Bbn Corporation|Centralized computer event data logging system|

US5796939A|1997-03-10|1998-08-18|Digital Equipment Corporation|High frequency sampling of processor performance counters|

US6189140B1|1997-04-08|2001-02-13|Advanced Micro Devices, Inc.|Debug interface including logic generating handshake signals between a processor, an input/output port, and a trace logic|

US6256775B1|1997-12-11|2001-07-03|International Business Machines Corporation|Facilities for detailed software performance analysis in a multithreaded processor|

US6233531B1|1997-12-19|2001-05-15|Advanced Micro Devices, Inc.|Apparatus and method for monitoring the performance of a microprocessor|

US6098169A|1997-12-23|2000-08-01|Intel Corporation|Thread performance analysis by monitoring processor performance event registers at thread switch|

US6134676A|1998-04-30|2000-10-17|International Business Machines Corporation|Programmable hardware event monitoring method|

US6353924B1|1999-02-08|2002-03-05|Incert Software Corporation|Method for back tracing program execution|

US6530076B1|1999-12-23|2003-03-04|Bull Hn Information Systems Inc.|Data processing system processor dynamic selection of internal signal tracing|

US6789182B1|2000-11-13|2004-09-07|Kevin Jay Brothers|System and method for logging computer event data and physical components of a complex distributed system|

US6769054B1|2001-02-26|2004-07-27|Emc Corporation|System and method for preparation of workload data for replaying in a data storage environment|

US6813731B2|2001-02-26|2004-11-02|Emc Corporation|Methods and apparatus for accessing trace data|

US6988155B2|2001-10-01|2006-01-17|International Business Machines Corporation|Aggregation of hardware events in multi-node systems|

US7080283B1|2002-10-15|2006-07-18|Tensilica, Inc.|Simultaneous real-time trace and debug for multiple processing core systems on a chip|

US7529979B2|2003-12-12|2009-05-05|International Business Machines Corporation|Hardware/software based indirect time stamping methodology for proactive hardware/software event detection and control|

US20060005083A1|2004-06-30|2006-01-05|International Business Machines Corporation|Performance count tracing|

US9038070B2|2004-09-14|2015-05-19|Synopsys, Inc.|Debug in a multicore architecture|

US7543161B2|2004-09-30|2009-06-02|International Business Machines Corporation|Method and apparatus for tracking variable speed microprocessor performance caused by power management in a logically partitioned data processing system|

US7673050B2|2004-12-17|2010-03-02|Microsoft Corporation|System and method for optimizing server resources while providing interaction with documents accessible through the server|

JP2008234191A|2007-03-19|2008-10-02|Toshiba Corp|Hardware monitor management device and method of executing hardware monitor function|

US8762951B1|2007-03-21|2014-06-24|Oracle America, Inc.|Apparatus and method for profiling system events in a fine grain multi-threaded multi-core processor|

US20110246521A1|2007-08-06|2011-10-06|Hui Luo|System and method for discovering image quality information related to diagnostic imaging performance|

JP4658182B2|2007-11-28|2011-03-23|株式会社荏原製作所|Polishing pad profile measurement method|

US20100083237A1|2008-09-26|2010-04-01|Arm Limited|Reducing trace overheads by modifying trace operations|

US8301759B2|2008-10-24|2012-10-30|Microsoft Corporation|Monitoring agent programs in a distributed computing platform|

WO2010097875A1|2009-02-24|2010-09-02|パナソニック株式会社|Data processing device and method|

US8572581B2|2009-03-26|2013-10-29|Microsoft Corporation|Measurement and reporting of performance event rates|

US8407528B2|2009-06-30|2013-03-26|Texas Instruments Incorporated|Circuits, systems, apparatus and processes for monitoring activity in multi-processing systems|

JP2011013867A|2009-06-30|2011-01-20|Panasonic Corp|Data processor and performance evaluation analysis system|

US20110047358A1|2009-08-19|2011-02-24|International Business Machines Corporation|In-Data Path Tracking of Floating Point Exceptions and Store-Based Exception Indication|

GB2478328B|2010-03-03|2015-07-01|Advanced Risc Mach Ltd|Method, apparatus and trace module for generating timestamps|

JP2011243110A|2010-05-20|2011-12-01|Renesas Electronics Corp|Information processor|

US8607202B2|2010-06-04|2013-12-10|Lsi Corporation|Real-time profiling in a multi-core architecture|

GB2481385B|2010-06-21|2018-08-15|Advanced Risc Mach Ltd|Tracing speculatively executed instructions|

US20120042212A1|2010-08-10|2012-02-16|Gilbert Laurenti|Mixed Mode Processor Tracing|

US20120179898A1|2011-01-10|2012-07-12|Apple Inc.|System and method for enforcing software security through cpu statistics gathered using hardware features|

US20120226839A1|2011-03-02|2012-09-06|Texas Instruments Incorporated|Method and System for Monitoring and Debugging Access to a Bus Slave Using One or More Throughput Counters|

US8706937B2|2011-03-02|2014-04-22|Texas Instruments Incorporated|Method and system of debugging multicore bus transaction problems|

US10642709B2|2011-04-19|2020-05-05|Microsoft Technology Licensing, Llc|Processor cache tracing|

US8683268B2|2011-06-20|2014-03-25|International Business Machines Corporation|Key based cluster log coalescing|

US8713370B2|2011-08-11|2014-04-29|Apple Inc.|Non-intrusive processor tracing|

US9237082B2|2012-03-26|2016-01-12|Hewlett Packard Enterprise Development Lp|Packet descriptor trace indicators|

WO2015016920A1|2013-07-31|2015-02-05|Hewlett-Packard Development Company, L.P.|Log analysis|

US20160070636A1|2014-09-04|2016-03-10|Home Box Office, Inc.|Conditional wrapper for program object|

EP3035249B1|2014-12-19|2019-11-27|Intel Corporation|Method and apparatus for distributed and cooperative computation in artificial neural networks|

CN106033385A|2015-03-19|2016-10-19|启碁科技股份有限公司|Method for tracking program execution state and multi-core processing system|

US11232848B2|2015-04-30|2022-01-25|Hewlett Packard Enterprise Development Lp|Memory module error tracking|

US20160378636A1|2015-06-26|2016-12-29|Intel Corporation|Software-Initiated Trace Integrated with Hardware Trace|

CN105354136B|2015-09-25|2018-06-15|华为技术有限公司|A kind of adjustment method, multi-core processor and commissioning device|

US9858167B2|2015-12-17|2018-01-02|Intel Corporation|Monitoring the operation of a processor|

US20170371761A1|2016-06-24|2017-12-28|Advanced Micro Devices, Inc.|Real-time performance tracking using dynamic compilation|

US9965375B2|2016-06-28|2018-05-08|Intel Corporation|Virtualizing precise event based sampling|

US9959498B1|2016-10-27|2018-05-01|Google Llc|Neural network instruction set architecture|

US10175980B2|2016-10-27|2019-01-08|Google Llc|Neural network compute tile|

US10127283B2|2016-10-31|2018-11-13|International Business Machines Corporation|Projecting effect of in-flight streamed data on a relational database|

US10365987B2|2017-03-29|2019-07-30|Google Llc|Synchronous hardware event collection|

US9875167B1|2017-03-29|2018-01-23|Google Inc.|Distributed hardware tracing|

KR101988558B1|2017-06-07|2019-06-12|현대오트론 주식회사|Apparatus and operating method for monitoring micro controller unit having multi-core|US9111072B1|2011-08-23|2015-08-18|Tectonic Labs, LLC|Anti-reverse engineering unified process|

US9875167B1|2017-03-29|2018-01-23|Google Inc.|Distributed hardware tracing|

US10365987B2|2017-03-29|2019-07-30|Google Llc|Synchronous hardware event collection|

US10255109B2|2017-04-17|2019-04-09|Intel Corporation|High bandwidth connection between processor dies|

US10466986B2|2018-03-30|2019-11-05|Oracle International Corporation|Optimized recompilation using hardware tracing|

CN109446024A|2018-10-16|2019-03-08|杭州绿湾网络科技有限公司|Using monitoring method and device|

CN110046116B|2019-04-23|2020-08-21|上海燧原智能科技有限公司|Tensor filling method, device, equipment and storage medium|

CN110351131A|2019-06-28|2019-10-18|北京奇才天下科技有限公司|It is a kind of for the monitoring method of distributed link, device and electronic equipment|

US11231987B1|2019-06-28|2022-01-25|Amazon Technologies, Inc.|Debugging of memory operations|

US11145389B2|2019-12-03|2021-10-12|Intel Corporation|Detection and error-handling of high error rate blocks during copyback|

KR102267920B1|2020-03-13|2021-06-21|성재모|Method and apparatus for matrix computation|

法律状态:
2021-04-20| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2021-06-15| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 20/10/2017, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US15/473,101|2017-03-29|

US15/473,101|US9875167B1|2017-03-29|2017-03-29|Distributed hardware tracing|

PCT/US2017/057619|WO2018182782A1|2017-03-29|2017-10-20|Distributed hardware tracing|

[返回顶部]